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0\ . Abstract 

Basic uniform pseudo-random number generators are implemented on ATI 
' Graphics Processing Units (GPU). The performance results of the realized 

generators (multiplicative linear congruential (GGL), XOR-shift (XOR128), 
p . RANECU, RANMAR, RANLUX and Mersenne Twister (MT19937)) on CPU 

^ ' and GPU are discussed. The obtained speed-up factor is hundreds of times in 

<~i , comparison with CPU. RANLUX generator is found to be the most appropriate 

for using on GPU in Monte Carlo simulations. The brief review of the pseudo- 
random number generators used in modern software packages for Monte Carlo 
^ , simulations in high-energy physics is present. 

Keywords: Monte Carlo simulations, GPGPU, pseudo-random number genera- 
tors, performance 



g ; 1 Introduction 

' The development of the General Purpose computing on Graphics Processing Unit 

{GPGPU) technology has opened recently a new cheap alternative to supercomput- 
ers and cluster systems for researchers. The primary role of the GPGPU technol- 
, ogy promotion justly belongs to nVidia company which is the one of the two main 

^ ' GPU hardware developers. nVidia motto is "One Researcher, One Supercomputer". 

It fully reflects the tendencies of the computational systems present-day market. 
Moreover, the systems which providing the GPGPU technology become the inte- 
gral parts of the contemporary supercomputers. In particular, while assembling 
new supercomputer TH-l with peak productivity 1.206 petaflops which is equipped 
with the ATI GPUs HD 4870X2 on every node, China became the third country to 
build a petaflop supercomputer (after USA and Germany) [T]. 

Pseudo-random numbers generation algorithms are key components in most 
modern packages for researches and result depends on their characteristics. The 
package should provide the actual investigation of the physics or the other simu- 
lated problems, but not to explore the behavior of the generator itself. Thus, while 
choosing generator, one should consider not only its performance productivity but 
also its statistical properties. Checking the key results with generator of a differ- 
ent class than used one is also very important [5]. "Pseudo-random" term actually 
means that for such values always exist an algorithm which may reproduce the whole 
sequence. The values, produced by some physical process, cannot be considered as 
random in a general sense, because even if we cannot predict the sequence of such 
numbers, it does not mean that there is no algorithm to produce them [5]. 
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The main aim of the present paper is to show the possibihty to adapt for GPU 
the most commonly used pseudo-random number generators (PRNG) as the main 
components of the software packages for Monte Carlo {M(Tj simulations. An essen- 
tial distinction of the GPU architecture from the Central Processing Unit (CPU) 
architecture causes new difficulties as well as new optimization capabilities. The 
question about the performances of the different PRNGs on GPU platform is arisen 
in this connection. The performance results and the difference between PRNGs 
performances on CPU and GPU will be shown in the paper. 

We will restrict our investigation only to uniform PRNGs. In particular some 
generators were already investigated for nVidia GPUs in [U [5] . But studied gener- 
ators are not the most frequently used ones in MC simulations. 

In the present paper we will not refer the question of the random-number gener- 
ator testing (see for example [B] and references therein) and will content only with 
implementation of existing generators on GPU. 

The paper is organized as follows. Section[5]contains the brief list of the software 
packages for MC simulations in high-energy physics with showing PRNGs which are 
used in the corresponding package. The details of the GPU implementation and the 
code listings of the realized PRNGs are described in section [3l The performance 
results and discussion are collected in section 01 In section [5] we draw our con- 
clusions. Finally, appendix includes some theoretical background for the realized 
PRNG algorithms. 

2 PRNGs in Monte Carlo simulations 

In this section we present the list of PRNGs which are employed in nowadays MC 
simulations. 

To choose a generator which will be used in MC simulations, it is not enough to 
consider only its period and algorithm output. The generator R250 was widely used 
due to its simplicity and long period, but it was found that this generator has the 
essential statistical defects which make impossible to use it in modern simulations 
(see [5] and reference therein). Apart from general statistical tests like DIEHARD 
Battery of Tests of Randomness and Crush, a generator has to pass empirical test 
in real conditions. That is why, while developing MC application, usually only gen- 
erators with minimal statistical influence for the present MC simulations are used. 
For example, European Organization for Nuclear Research (CERN) recommends 
using three PRNGs: RANMAR (V113), RANECU (V114) and RANLUX (V115). 

Below is the incomplete list of the software packages for MC simulations in 
high-energy physics and the PRNG which is used in corresponding package. 

• FermiQCD: is an open C++ library for development of parallel Lattice 
Quantum Field Theory computations |32[ 133) . It uses floating-point version 
of RANMAR generator as defauh PRNG. 

• UKQCD: By indirect information UKQCD CoUaboration also uses RANMAR 
PRNG [Ml in its software. 

• MILC: is an open code of high performance research software written in C 
for doing SU{3) lattice gauge theory simulations on several different (MIMD) 
parallel computers [35]. MILC uses own XOR of a 127-bit feedback shift 
register and a 32 bit integer congruential generator. Each node or site uses a 
different multiplier in the congruential generator and different initial state in 
the shift register. So, all nodes are generating different sequences of numbers. 

t = (((X„-5 > 7) OR (X„_6 « 17)) XOR (1) 



2 



((X„_4 » 1) OR {Xn-5 « 23))) AND (2^4 - l), 

= Xn, Xn ~ Y„ ~ ajYn-1 + C, 

zn = {t XOR {{Yn » 8) AND (2^4 - l)))/224 

• CPS: The Columbia Physics System is a large set of codes for lattice QCD 
simulations [3^. RAN3 is used. 

• SZIN: is the open-source software system supports data-parallel programming 
constructs for lattice field theory and in particular lattice QCD [221. RANLUX 
is used in SZIN packet. 

• ISAJet: is a Monte Carlo event generator for pp, pp, e^e~ interactions [38[ 
[39] . RANLUX is incorporated into ISAJet. 

• GEANT4: is a toolkit for the simulation of the passage of particles through 
matter gD]. GEANT4 uses the HEPRandom module [JT] to generate 
pseudo-random numbers which includes 12 different random engines (RAN- 
MAR, RANECU, DRAND48, RANLUX, etc) now. 

• PYTHIA: is a program for the generation of high-energy physics events 
[His]. RANMAR is used in PYTHIA as an internal PRNG. 

• HERWIG: is a Monte Carlo package for simulating hadron emission reactions 
with interfering gluons |44[ I45| . RANECU is used. 

• CompHEP: a package for evaluation of Feynman diagrams, integration over 
multi-particle phase space and event generation [46II47| . DRAND48 is used in 
CompHEP. 

• MC@NLO: is a Fortran package to implement the scheme for combining a 
Monte Carlo event generator with Next-to-Leading- Order calculations of rates 
for QCD processes |15]|15]. GGL is used. 

• SHERPA: is a Monte Carlo event generator for the simulation of high- 
energy reactions of particles in lepton-lepton, lepton-photon, photon-photon 
and hadron-hadron collisions |50[ 151) . RANMAR is used. 

• Chroma: is a software system for lattice QCD calculations [S51 [S3]- In 
Chroma slightly modified linear congruential generator RAN NYU is imple- 
mented - LCG(a = 31167285,771 = 2^8). 

• GENIE: is a neutrino event generator for experimental physics community 
[MllIS]. MT19937 is used. 

• ALPGEN: is an event generator for hard multiparton processes in hadronic 
collisions [51 [57]. RANECU is used. 

So, the most commonly used generators are RANMAR, RANLUX, RANECU 
and several variations of a linear congruential generator (DRAND48, RAN3, GGL, 
RANYU). Also, most new packages began to include the generator MT19937 as in- 
ternal PRNG. Hence, it is reasonable to implement the generators on GPU which 
are used in real MC simulations. 
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Table 1: General information about some ATI's video cards |60| . 



Model 


Stream 
cores 


Core 
(MHz) 


Memory 
(MHz) 


Bandwidth 

(GB/s) 


Bus width 
(bit) 


Tflops 
(peak) 


HD 4850 


800 


625 


993* 


64 


256 


1.0 


HD 4870 


800 


750 


900 


115.2 


256 


1.2 


HD 4870X2 


2 X 800 


2 X 750 


900 


2 X 115.2 


2 X 256 


2.4 


HD 5850 


1440 


725 


1000 


128 


256 


2.1 


HD 5870 


1600 


850 


1200 


153.6 


256 


2.7 


HD 5970 


2 X 1600 


2 X 725 


1000 


2 X 128 


2 X 256 


4.6 



3 GPU implementation 

In this section we give the base ideas for PRNG implementation on GPU and 
describe the source codes of the following PRNGs realizations on ATI GPUs: GGL, 
XOR128, RANECU, RANMAR, RANLUX and MT19937. 

We use ATI Stream SDK [H] for the reahzation of the PRNGs on ATI GPU 
as software environment. ATI GAL allows using ATI GPU hardware in the most 
effective way [5Sj. That is why all PRNGs realizations presented below are made on 
ATI Intermediate Language (IL) [51], and not on higher level, for example OpenCL 
or Brook+. 

Three different ATI video cards are used to check the algorithm efficiency - ATI 
Radeon HD 5850, HD 4870 and HD 4850. The essential for GPGPU-apphcations 
parameters about some ATI's video cards are presented in Table [TJ All cards are 
equipped with the GDDR5 memory except HD 4850 which contains GDDR3 mem- 
ory (this fact was marked with the asterisk). GDDR5 memory possesses a quadruple 
effective data transfer rate relative to its physical clock rate, instead of double as 
with GDDR3 memory. It is necessary to note that HD 4870X2 and HD 5970 are 
two-core cards. For our purposes at program level they are equivalent to two devices 
installed in the system. 

3.1 General implementation scheme 

To design the GPU-applications it must be accounted the following main features 
of hardware architecture: 

• each general-purpose register and memory cell has the four 32-bit components 
that are designated as .x, .y, .z and .w; 

• floating point operations are more productive on GPU than integer operations 
(in compare with CPUs); 

• double precision floating point operations are the slowest on GPU; 

• ATI GPU can perform up to five operations on each VLIW processor simul- 
taneously. 

Computing programs working for GPU are known as kernels. All kernels are run 
by the host program. Each kernel must perform a complete operation, because of 
ATI Stream SDK does not support the execution of a kernel from another kernel 
(except OpenCL applications). 

We offer to use PRNGs directly in MC procedures as the separate subroutines. 
Undoubtedly such scheme allows the possibility to keep generated random numbers 
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PRNG 
bootstrap 

PRNG 
initialization 

generation 



finalization 



Lag table, 
indices, 
carry, 
etc. 



Figure 1: PRNG structure scheme 



in GPU-memory, for example, for further usage by other procedures, as it is usuahy 
made in CPU-simulations. However, it seems to be more effective to "virtualize" 
produced by PRNG pseudo-random numbers as it allows to eliminate unnecessary 
additional read-write operations to the GPU-memory for the random numbers as 
well as to decrease the GPU-memory consumption of the application. 

In order to accelerate GPU-applications work, we recommend to keep all data 
needed for kernel operations directly in GPU-memory, because memory operations 
are a bottleneck of GPU. 

For MC simulations on GPU it is convenient to produce four random numbers 
through one PRNG pass which is closely related to GPU memory architecture. 
Under such conditions one can use GPU performance in the most efficient way. As 
a rule more than one random number are used in MC simulations for updating (for 
example, one for update proposition and one for probability). So, it seems to be 
natural to generate the corresponding number of the pseudo-random numbers for 
further purposes. 

The common structure scheme of PRNGs is shown in Figure [TJ All implemented 
PRNGs are divided into 3 phases: 

• initialization: loading and preparing the previous state of PRNG and all 
preparing operations for PRNG; 

• random number generation: actually all PRNG operations which generate 
pseudo-random numbers; 

• finalization: storing the final state of PRNG for the next run. 

It allows to make the best use of PRNG, as it is needed a few more random numbers. 
For example, at multihit updating method the several pseudo-random numbers are 
required per one pass of the updating procedure. The initialization subroutine of 
the PRNG is called on the start of the updating procedure. During the work the 
updating procedure may call the PRNG generation subroutine many times. And 
finally, at the end it must call the finalization subroutine of the PRNG. 

All PRNG lag tables are stored directly in GPU-memory and it avoids needless 
transfer the data between CPU and GPU memory. Each instance of the PRNG uses 
its own lag table to parallelize the process. The universal method of the paralleling 
is used in all realized PRNGs - using the independent sequences. To be exact, every 
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call 11 

// random number in rZOQ 



func 10 // PRNG initialization 

dcl_literal 1200, 0, 0, %-PM_r%, %PM 
dcl_literal 1201, %PMSEED_MASK% , %PM 
and r201.K,vaTid.x, 1201.x 
raov r203,g [r201 .x+%iIPMSEED_START%] 
ret_dyn 
endf unc 



_FP% 

%, %PM I 



func 11 // PENG generation 

udiv r202,r203,1201.wwww 
uraod r204,r203,1201.wwww 
umul r205,r202, 1200. zzzz 
imad r206, r204 , 1201 . yyyy , r205 
lit r207,1200.xxxx,r206 
cmov^logical r208, r207, 1200 .xx 
iadd r203,r206,r208 
utof r200,r203 
div r200,r200,1200.wwww 
ret_dyn 



fu 



12 // PRNG finalization 

V g[r201.x+%iIPMSEED_START%] ,r203 
t_dyn 



Figure 2: Source code of GGL PRNG 



running instance of PRNG obtains its own index J = 0, 1, ... , Jmax- The lowest 
bits of J serve to identify the lag table for PRNG. Another possible scheme which 
may be used is the selection of the lag table through the modulo operation. 

We use a little bigger number of actual PRNG instances than the number of 
the stream cores in particular GPU to avoid possible collisions among threads. For 
example, top one-unit ATI GPU card HD 5870 has 1600 stream cores (see Table [T]), 
while 13 lowest bits were used to identify the lag table for PRNG. It corresponds 
to the 8192 parallel instances of the PRNG. 

Almost all generators require the initialization procedure to bootstrap the lag 
table (three of presented here generators, RANMAR, RANLUX and MT19937). This 
procedure is realized directly on the GPU unit, not with the help of the CPU 
with further transfer the lag table into GPU-memory. We do not show lag table 
initialization procedure here to save the space in the article. 

For convenience and performance purposes all the kernels have been precompiled 
to replace all constants %VariableName% with the corresponding values. It allows 
to avoid the constant buffer using and put the runtime constant parameters directly 
in assembler code. Under %VariableName% the value of the respecting variable will 
be implied in source codes below. Sometimes before the VariableName it will stay 
the prefix i (like %iVariableName%) which will show the decimal format instead of 
the default hexadecimal format. Decimal format is needed to specify the relative 
offsets in global memory operations. 

Note that each of the presented generators can be easily modified for the genera- 
tion of the pseudo-random numbers with double precision. All presented generators 
are either 24-bit. or 32-bit, while for number representation with double precision 64 
bits are required. Thus, for correct generation pseudo-random numbers with double 
precision one should use several numbers with single precision, but not only covert 
a number with the single precision into the double precision number by the regular 
conversion command (it considerably decreases the quantity of possible realization). 

3.2 GGL 

GGL is one of the simplest and computational "light" portable PRNG which could 
be implemented on GPU. GGL has been studied by Langdon on nVidia Tesla 
TlOP with CUBA SDK [5]. The obtained peak performance of the PRNG is 
about 2 X 10^ pseudo-random numbers per second. It is obvious that while period 
-Pggl < 2147483646 and its run in 1024-threads and output 10^ pseudo-random 
numbers per second the GGL period could be exhausted in about 0.002 second. So, 
GGL are realized here just to check the performance of the pseudo-random number 
production on ATI GPUs on a very simple generator which requires to store only 
one seed value per PRNG instance. 
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Park and Miller have published the four Pascal implementations of the GGL |12| . 
for integer and real arithmetic (one direct scheme and one avoiding the overflow). 
For implementation we chose the "Integer version 2" of the GGL [T^. The ATI IL 
source code of GGL PRNG is presented in Figure There are three subroutines 
10, 11 and 12 after the main module. Subroutines 10 and 12 are called only once 
and perform the initialization and the finalization of the PRNG, correspondingly. 
Subroutine 11 produces four-component pseudo-random numbers in register R200 
per one pass. This program scheme fully corresponds to the basic scheme, described 
in subsection 13.11 and hereinafter will be used for the rest of generators presented 
in this work. 

The used variables are: 

PM_m = 2147483647, PM_m_FP = 2147483647.0, 
-PM_r = -2836, PM_a = 16807, 
PM_q = 127773, 

PMSEED MASK specifies the PRNG instance and ilPMSEED START is the offset 
of the lag table in the global buffer which is prepared by the host program. 

The parallelization scheme with separate sequences is implemented here - every 
GPU thread produces four separate pseudo-random sequences (by every slot of gen- 
eral purpose register R200). So, the whole number of the pseudo-random sequences 
is the quadruple number of the threads to be run. The threads must be initialized 
carefully to reach the maximal period length of the PRNG and avoid the sequences 
overlapping. 

Generator requires to keep only one seed-value per thread for work which is 
equal to choosing only one four-component cell per thread in global memory which 
is read and kept only once per PRNG cycle. Thus, for 4096-thread run (or 16384 
subthreads) the size of the lag table will be 64kB. Initial seeds are filled with the 
host program and transferred to GPU memory before the first run of the GGL. Seed 
values must be exactly chosen to rich the maximal period of PRNG. But in our 
case the performance of the generator is the main aim, so all the seeds are selected 
randomly, because the generator exhausted whole the period for very short time 
and it is not so important when and where the overlapping of the sequences will 
began. 

The other LCGs could be easily realized on the given example of the PRNG 
by substituting the corresponding LCG parameters. GGL generator possesses very 
good performance, as it will be shown in the next section. Nevertheless, in spite 
of the PRNG performance, it would not be used for practical purposes due to very 
short period. 

3.3 XOR128 

Next PRNG implemented on ATI GPU is XOR128, a very fast generator with 
much better statistical properties and considerably longer period than GGL. The 
distinctive feature of this PRNG is the usage of the four-component 32-bit values to 
produce the sequence which exactly corresponds to bit-capacity of GPU memory. 

The ATI IL source code of XOR128 is shown in Figure[3] The following variables 
are used 

XR_L1 = 11, XR_R1 = 19, 
XR_R2 = 8, XR_m = 4294967295.0, 

XRSEED MASK specifies the PRNG instance, ilXRMSEED START is the offset of 
the lag table in global buffer which is prepared by host program. 
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cs_2_0 
all 10 



call 11 

// random number in rZOQ 



func 10 // XOR128 initialization 

dcl_literal 1200, %XRSEED_MASK%, %XR_L1%, %XR_R1%, %XR_R2% 
dcl_literal 1201, %XR_ra%, 0, 0, 
and r204.x,vaTid.x, 1200.x 
raov r201,g [r204 .x+%iIXRSEED_START%] 
ret_dyn 
endfunc 



func 11 // XOR128 generation 

Ishl r202,r201,1200.yYYY 
ixor r202,r202,r201 
ushr r203,r202,1200.wwww 
ixor r203,r202,r203 
ushr r201.x,r201.w,1200.z 
ixor r201.x,r201.x,r201.w 
ixor r201.x,r201.x,r203.x 
ushr r201._Y,r201.x,1200.z 
ixor r201._Y,r201.y,r201.x 
ixor r201._Y,r201.y,r203.Y 
ushr r201._z,r201.Y,1200.z 

ixor r201. z , r201 . z , r201 . y 

ixor r201. z , r201 . z , r203 . z 

ushr r201. w, r201 . z, 1200 . 

ixor r201._ ^w, r201 . w, r201 . 

r201. w,r201.w,r203. 

r200,r201 

r200,1201.xxxx 



utof 
dlv r200 
_dyn 



■ndfu 



unc 12 // XOR128 finalization 

mov g[r204.x+%iIXRSEED_START%] 
ret_dYn 



Figure 3: Source code of XOR128 PRNG 



The number of sequences produced by XOR128 corresponds to the number of 
threads. One thread generates four successive pseudo-random numbers from one 
sequence in every components of the general purpose register R200. Marsagha's 
algorithm |25| is slightly modified to parallelize sequence production into four sub- 
threads - four items of the lag table are composed in one four-component memory 
cell and are produced in one pass. Unfortunately, it is impossible to avoid re- 
cursion without making the algorithm more complicated. Therefore among the 
four-component operations in the source code there are single-slot operations which 
are partially parallelized further by the IL compiler. 

Except the final integer-to-float conversion operations in the algorithm, there are 
only bit shift and exclusive-OR operations which are "computationally light" GPU 
operations in compare with integer operations. Along with few memory operations 
(one read and one written operations per run) , it also increases the performance of 
PRNG on GPU. 

The XOR128 generator has a period large enough to be used on GPU. For 2048- 
threads run and average performance about 10^° samples per second it could be 
exhausted in about 10^^ years only. And only strong criticism of L'Ecuyer [26] does 
not to allow use XOR128 as standard PRNG on GPU. 

Generator requires keeping four 32-bit integers per thread. In the present real- 
ization PRNG produces four sequential pseudo-random numbers per thread which 
allows reserving only one 4-component cell per thread in global memory. This 4- 
component seed is read and written only once per PRNG cycle. The size of the 
XOR128 lag table is the same to GGL, i.e. for 4096-thread run it takes the 64kB of 
the GPU memory. 

3.4 RANECU 

Another high-performed PRNG reahzed on ATI GPU is RANECU. This generator is 
recommended by CERN and used in some software packages, listed in the previous 
section. Relatively long period and quite good statistical properties make RANECU 
reasonably attractive generator for small tasks. 

The scheme like in the case of GGL generator is implemented here: each thread 
produces the four independent sequences which are composed into the four compo- 
nents of the general purpose register R200. For RANECU run in 1024-threads and 
output 5 X 10^ pseudo-random numbers per second the RANECU period could be 
exhausted in about 31 hours. Despite this fact the generator is still acceptable for 



8 



call 11 

// random nuinber in c200 



func : 
del 



del 
del" 



dcl_ 
and 



// RanEcu initialization 

literal 1200, %RESEED_MASK% , %RESEEDP11%, 
%-RESEEDPll%, %-RESEEDP12% 

literal 1201, %RESEEDP13%, 0, %REICONS%, %RESEEDP21% 

"literal 1202, %-RESEEDP21%, %-RESEEDP22%, 
%RESEEDP23%, %REIC0NS2% 

literal 1203, 1, %REIC0NS3%, %RETW0M31%, 

"r201.x,vaTid.x, 1200.x 

r202,g [r201.x+%iIRESEED_START%] 

r203,g [r201.x+%iIRESEED_START2%] 

dyn 



func 11 // RanEcu generation 

udiv r204,r202,1200.Yyyy 
imad r205 , r204 , 120 . zzzz, r2 02 
imul r20e,r204,1200.wwww 
imad r205 , r205 , 120 1 . xxxx, r2 6 
ilt r207,r205,1201.yYyy 

cmov_logical r207 , r207 , 1201 . zzzz , 12 01 . yyyy 

iadd r202, r205, r207 

udiv r204,r203,1201.wwww 

imad r2 05 , r204 , 1202 .xxxx, r2 03 

imul r206,r204, 1202. yyyy 

imad r2 05 , r205 , 1202 . zzzz, r2 6 

ilt r207,r205, 1201. yyyy 

cmov_logical r207 , r207 , 1202 . wwww, 12 01 .yyyy 
iadd r203,r205,r207 
inegate r204,r203 
iadd r205,r202,r204 
ilt r207,r205, 1203. xxxx 

cmov_logical r207 , r207 , 1203 . yyyy, 12 01 . yyyy 

iadd r200,r205,r207 

utof r200,r200 

mul r200, r200, 1203 . zzzz 

endfunc 



func 12 // RanEcu f inalization 

mov g[r201.x+%iIRESEED_START%] , 
mov g[r201.x+%iIRESEED_START2%] 

endfunc 
end 



:202 
r203 



Figure 4: Source code of RANECU PRNG 



wide range of MC siraulation tasks. 

The ATI IL source code of the RANECU PRNG is presented in Figure S] The 
fohowing variables are used 

RESEEDPll = 53668, -RESEEDPll = -53668, 

-RESEEDP12 = -12211, RESEEDP13 = 40014, 

RESEEDP21 = 52774, -RESEEDP21 = -52774, 

-RESEEDP22 = -3791, RESEEDP23 = 40692, 

REICONS = 2147483563, REIC0NS2 = 2147483399, 

REIC0NS3 = 2147483562, RETW0M31 = 1.0/2147483648.0, 

ilRESEED START and ilRESEED_START2 are the offsets of two tables of seeds in 
global buffer, RESEED MASK specifies the PRNG instance. In fact, the RANECU 
lag table is divided into two tables for convenience here. These tables are prepared 
by the host program. 

The generator requires keeping two integer seed-values per thread which are 
grouped into two lag subtables. So for 4096-thread run the size of the lag table is 
128kB. 

The generator kernel is made on integer arithmetic base which slightly brings 
down the generator performance while using GPUs. However, simplicity of the 
algorithm and small amount of the lag table elements completely compensate this 
"drawback". On the base of the present code, one can easily construct another MRG 
by substitution of the corresponding parameters. 



3.5 RANMAR 

The next generator reahzed on ATI GPU is the 24-bit Marsaglia PRNG RANMAR. 
It was previously implemented on ATI GPU in [58] for Ising model and SU{2) 
gluodynamics simulations. 

In contrast to previously presented PRNGs, RANMAR has a larger lag table 
which contains 97 elements. All these lag table items must be prepared before 
the first working pass of the generator. Of course, the lag table may be directly 
initialized by the user, but it is not convenient in practice. So, the RANMAR lag 
table is initialized by stand-alone procedure RMARIN, proposed by James [23]. In 
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call 11 

// random number in r200 



func 10 // RanMar initialization 

dcl_literal 1200, %RMSEED_MASK% , %RMSEED_START% , 
dcl_literal 1201, 0.0, 1.0, 0.0, 0.0 
dcl_literal 1202, %RM_CD%, %RM_CM%, 0, 
dcl_literal 1203, %RMSEED_ALL%, %RMSEED_ALL%, 

%RMSEED_97%, %RMSEED_97-1% 
land r201.x,vaTid.x, 1200.x 
umad r202.xy, r201 .XX, 1203. xy, 1200 .yy 
iadd r203.xy,r202.xy,1203.zw 
raov r204.xy,g[r203.x] .xy 
raov r205,g[r203.x+l] 
iadd r206.xy,r202.xy,r204.xy 
ret_dyn 
endfunc 



func 11 // RanMar generation 

sub r207,g[r206.x] ,g[r206.y] 
It r208, r207, 1201 .xxxx 

cmov_logical r210 , r2 08 , 120 1 . yyyy , 1201 .x 

add r207, r207, r210 

mov g[r206.x] ,r207 

iadd r206.xy,r206.xy,1200.zz 

lit r211.xy,r206.xy,r202.xy 

cmov r206.xy,r211.xy,r203.yy 

sub r205,r205, 1202. xxxx 

It r212, r205, 1201 -xxxx 

cmov_logical r20 9, r212 , 1202 . yyyy , 1200. vi 

add r205, r205, r209 

sub r207, r207, r205 

It r209, r207, 1201 -xxxx 

cmov_logical r20 9, r2 09 , 1201 . yyyy , 1200. w 
add r207, r207, r209 
mov r200, r207 
ret_dyn 

func 12 // RanMar f inalization 

iadd g[r203.x] . xy, r20 6 . xy , r202 . xy_neg (xy 
mov g[r203.x+l] , r205 
ret dyn 

end 



Figure 5: Source code of RANMAR PRNG 



this procedure, the whole lag table is initialized on the base of only two given 5- 
digit integers, each set of which causes an independent sequence of the sufficient 
length for an entire calculation. The seed variables can have values between and 
31328 for the first variable and and 30081 for the second variable, respectively. 
RANMAR can create, therefore, 900 million independent subsequences for different 
initial seeds with each subsequence having a length of about 10'^° pseudo-random 
numbers. This approach considerably reduces the number of the possible generator 
states. Still it brings an important element of the generator features - easy division 
of sequences produced by generator among PRNG instances without overlapping. 
Possible sequences quantity (900 millions) at existing or developing hardware is 
considered to be sufficient even in medium-term perspective. 
The generator consists of two parts: 

• kernel which produces the seed numbers on initial seed values; in fact this 
kernel is a replica of the RMARIN subroutine of the James' version |23| : 

• subroutines which directly produce the random numbers. 

First part is executing only for initialization. 

The floating-point version of the RANMAR generator is realized here which pro- 
duces the pseudo-random directly in the interval [0; 1). So, it is not needed to use 
slowest integer-to-floating point converting operation. 

Apart from 97 elements in the lag table each RANMAR instance must store 
previous value of arithmetic sequence and two indices which are connected to each 
other. As in the case of GGL PRNG, every GPU thread produces four indepen- 
dent sequences in the presented implementation of RANMAR. Obviously at such 
approach, it is necessary to keep only one pair of indices for all four subthreads, 
because these subthreads are executed out synchronously. So, it requires 2.5 mem- 
ory cells be read and written for one PRNG pass. The size of RANMAR lag table 
for 4096-thread run is about 6MB. Initially lag table is prepared by stand-alone 
procedure RMARIN which is not shown here. 

The ATI IL source code of RANMAR PRNG is presented in Figure O The 
following variables are used 

RM_CD = 7654321.0/16777216.0, RM_CM = 16777213.0/16777216.0, 
RMSEED_97 = 97, RMSEED_97 - 1 = 96, 
RMSEED_ALL = 99, 
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RMSEED START is the offset of the lag table in global buffer, RMSEED MASK 
specifies the PRNG instance. 

3.6 RANLUX 

Nowadays RANLUX PRNG is one of the standard high-performed generators for 
Monte-Carlo simulations. The statistical properties of the generator are well-known. 
From the realization point of view, distinctive feature of the generator is the neces- 
sity to discard out groups of generated pseudo-random numbers after one generation 
cycle. Omitted values quantity is determined by the "luxury" parameter. While im- 
plementation of the algorithm on the central processing unit such discarding is 
"virtual", because lag table fits in processor cache very well, as a rule. Still this 
algorithm phase is very resource-intensive on GPU, because global buffer is not a 
generally cached object. Thus it is obvious, that the PRNG performance is strongly 
depend on this phase of algorithm realization. 

To implement the RANLUX generator, we use three quite different approaches. 
First of all, the direct translation of algorithm is performed. In this scheme all the 
seed values are updated directly in GPU global memory. Every thread in point 
of fact generates simultaneously the four independent sequences of the pseudo- 
random numbers. Luxury is performed for all four subthreads at the same time. 
This approach makes the algorithm considerably simpler, but does not allow getting 
the best performance. 

Next evident approach is to use the indexed temporary array for luxury opera- 
tion. It allows to make process execution much faster: at luxury level=3 3.5 times 
faster and 5 times faster at luxury level=4, keeping the complexity of the algorithm 
meanwhile. 

Third approach which really allows to make algorithm faster and in practice 
minimize the dependence of the execution time on luxury level, is the "planar" 
scheme. The geometry of the lag table is taken into account as much as possible in 
this scheme. For RANLUX it consists of the 24 elements which naturally could be 
grouped into six 4-component 32-bit registers. The new difficulty to vectorize the 
RANLUX algorithm is arisen due to the necessity of the recursive calculation of the 
carry bit c„ which depends on the preceding states of the generator. Therefore, some 
serial operations are appeared in planar RANLUX code as well as in presented here 
XOR128 implementation. Planar RANLUX procedure produces four pseudo-random 
numbers from one sequence for one pass of the generator. 

Base algorithm RANLUX requires discarding 24, 73, 199 and 365 values after 
one RANLUX cycle for luxury level 1, 2, 3 and 4, respectively. However to perform 
luxury in the planar implementation of the RANLUX, it is convenient to discard 
some larger number than ones, proposed by Liischer |30| . Strictly speaking, it 
is convenient to discard the number of the values which are multiply by 24. In 
CPU implementation such approach seems to be redundant (see [30]), but on GPU 
it shows better performance. Thus, in planar scheme the following numbers are 
discarded: 24, 96, 216 and 384 (for luxury levels 1, 2, 3 and 4, respectively). 

The ATI IL source code of planar RANLUX PRNG is presented in Figure[6l The 
following variables are used 

RLSEED_ALL_4 ^ 7, RLTWOM24 = 2^24^ 

RLSEED_24_4 = 24/4 = 6, RLSEED_24_4 -1 = 5, 

RLSEED START is the offset of the lag table in global buffer, RLNSKIP is the num- 
ber of generated values to be discarded (is defined by the luxury level), RLSEED - 
MASK specifies the PRNG instance. 

Due to relatively large lag table, RANLUX as well as RAN MAR generator requires 
the stand-alone initializing procedure. This procedure is running only once and does 
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cs_2_0 
all 10 



call 11 

// random nuntber in c200 



unc 10 // RanLux i 

del literal 1200, 



%RLSEED_MASK%, %RLSEED_ALL_4%, 
%RL SEE D_S TART % , -2 4 

%RLNSKIP%, %RLTW0M24%, %RLSEED_2 4_4 % 
%RLSEED 24 4-1% 



dcl_ 
dcl_ 
dcl_ 
and 



.iteral 1202, 19, 5 
.iteral 1203, 3, 1, 
.iteral 1204, 0.0, 
■201. x,vaTid.x, 1200 



20, 



umad r201.x,r201.x, 1200. y, 1200. z 
iadd r202.xy,r201.xx,1201.zw 
f e n c e_inemo r y 
mov r203,g[r202.x] 
ushr r204.xyz,r203.xyz,1202.ww0 
iadd r205.xyz,r201.xxO,r204.xYz 
mov r206,g[r205.y] 
ret_dyn 
ndfunc 

unc 11 // RanLux generation 

if_logicalz r205.z 
mov r207.x, 1201.x 
if_logicalnz r207.x 
mov r208,g[r201.x ] 
mov r209,g[r201.x+l] 
mov r210,r206 
mov r211,g[r201.x+3] 
mov r212,g[r201.x+4] 
mov r213,g[r201.x+5] 
whileloop 

lit r207 ._y, 1204 . z, r207 -x 

break logical z r2 07 . y 

mov r214,r213 

mov r215, r209 

call 13 

mov r213.xyz' 

mov r214, r212 

mov r215, r208 

call 13 

mov r212.xyzi 

mov r214, r211 

mov r215,r213 

call 13 

mov r211 . xyzi 

mov r214,r210 

mov r215, r212 

call 13 

mov r210 .xyz' 

mov r214, r209 

mov r215, r211 

call 13 

mov r20 9 . xyz' 

mov r214,r208 

mov r215,r210 

call 13 

mov r208 . xyzi 

iadd r207 .: 
endloop 
mov r206, r210 
mov r214, r213 
mov r215, r209 
call 13 



r216 . 



, r216 . 



, r216 . 



r216 . 



, r216 . 



, r216 .■ 
,r207 .X, 



mov r213 . xyzw, r21 5 . wzyx 
mov r200,r216 
mov r203. xyz, 1202. xyz 
ushr r204. xyz, r203. xyz, 1202. wwO 
iadd r205.xyz,r201.xx0,r204.xyz 
mov g[r201 .X ] ,r208 
mov g[r201 .x+1] , r209 
mov g[r201 .x+2] , r210 
mov g[r201 .x+3] , r211 
mov g[r201 .x+4] , r212 
mov g[r201 .x+5] , r213 

mov r214,g[r205.x] 

iadd r217. xyz, r205. xyz, 1203. zzw 

ige r218 .xy, r217 .xy, r201 .XX 

cmov_logical r217 . xy, r2 18 . xy, r2 17 . xy, r202 . yy 
mov r215,g[r217.y] 
call 13 

mov g[r205.x] .xyzw, r216. wzyx 
mov r205,r217 
endif 
else 

mov r214,g[r205.x] 

iadd r217. xyz, r205. xyz, 1203. zzw 

ige r218.xy,r217.xy,r201.xx 

cmov_lDgical r217 . xy , r2 1 8 . xy , r217 . xy , r202 . yy 
mov r215,g[r217.y] 
call 13 

mov g[r205.x] .xyzw, r216. wzyx 
mov r205, r217 
endif 

mov r200,r216 
ret_dyn 

func 12 // RanLux f inalization 

fence memory 

iadd r203 .xyz, r2 05 .xyz, r2 01 . xxO_neg (xy) 

ishl r203. xyz, r203. xyz, 1202. wwO 

iadd r203.xy,r203.xy,1203.xy 

mov g[r201 .x+6] , r203 

ret_dyn 

func 13 // RanLux update 

sub r216.xy,r206.yx,r214.wz 
sub r216.x,r216.x,r203.w 
It r219.x,r216.x, 1204.x 

cmov_logical r220 . x, r21 9 . x, 120 4 . y, 12 04 . x 
add r216.x,r216.x,r220.x 

cmov_logical r203. w, r2 1 9 . x, 120 1 . y, 12 04 . x 
sub r216._y,r216.y,r203.w 
It r219.x,r216.y, 1204.x 

cmov_logical r22 . x, r21 9 . x, 1204 . y, 12 04 . x 
add r216._y,r21S.y,r220.x 

cmov_logical r203. w, r2 1 9 . x, 120 1 . y, 12 04 . x 

sub r221.xy,r215.wz,r214.yx 
sub r221.x,r221.x,r203.w 
It r219.x,r221.x, 1204.x 

cmov_logical r220 . x, r21 9 . x, 120 4 . y, 12 04 . x 
add r221.x,r221.x,r220.x 

cmov_logical r203. ^w, r2 1 9 . x, 120 1 . y, 12 04 . x 

sub r221._y,r221.Y,r203.w 
It r219.x,r221.y, 1204.x 

cmov_logical r220 . x, r21 9 . x, 120 4 . y, 12 04 . x 
add r221._y,r221.Y,r220.x 

cmov_logical r203. w, r2 1 9 . x, 120 1 . y , 12 04 . x 

mov r216. z,r221.x 

mov r216.^ ^w,r221.y 

mov r206,r215 
ret_dyn 
endf unc 



Figure 6: Source code of RANLUX PRNG 
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not take much resources, so its listing is not presented here. Also, it may be realized 
in the host program with further copying the result to the GPU global buffer. 

Generator RAN LUX, as well as RAN MAR PRNG, possesses the important feature 
in the context of the GPU realization - the generator kernel is build on floating- 
point arithmetic. Each instance of the planar version of the RAN LUX requires 
storing seven 4-component cells only (three indices which are connected to each 
other, 24 items of lag table and carry bit). So for 4096-thread run the size of the lag 
table is 448kB (or 1664kB for the case of the global buffer and temporary indexed 
array RANLUX versions). 



3.7 Mersenne Twister 

The last PRNG which is implemented in this work is MT19937, one of the Mersenne 
Twister generators family. This generator seems to be very attractive nowadays due 
to its extremely long period and relatively easy algorithm. 

Mersenne Twister is incorporated in nVidia SDK as sample. The realized in 
nVidia SDK version of the Mersenne Twister contains 19 element lag table and 
period about 2^°''. It is easy to implement the MT19937 generator on the basis of 
this example by substituting the relevant parameters. 

In the presented realization of MT19937 we use the same scheme as in the 
planar implementation of RANLUX. Whole 624-element lag table is located into 156 
four-component cells. So, for the one pass of the PRNG it is easy to obtain four 
sequential pseudo-random numbers. But unfortunately, it is impossible to store 
whole lag table in the general-purpose registers to perform the lag table updating 
procedure, because the total number of the general-purpose registers allocated for 
one thread is 128. Thus, all operations with the lag table are realized with the slow 
direct access to the global buffer, what greatly reduces the total performance of 
the generator. Nevertheless, the using of the four-component elements somewhat 
compensates this disadvantage. 

The ATI IL source code of MT19937 PRNG is presented in Figure El The 
following variables are used 

MTSEED_624_4 = 156, MTSEED_ALL = 157, 

MTSEEDP2 = 9D2C5680i6, MTSEEDP3 = EFC6OOOO16, 

MTSEEDP4 = 0.5, MTSEEDP5 = 4294967296.0, 

MT_UPPER_MASK = 8OOOOOOO16, MT_LOWER_MASK = 7FFFFFFFi6, 
MT_MATRIX_A = 9908B0DFi6, 

MTSEED_START is the offset of the lag table in global buffer, MTSEED MASK 
specifies the PRNG instance. 

The lag table of the MT19937 is the biggest one among the lag tables of the 
presented generators. For 4096-thread run its size is about 10MB. MT19937 also 
requires the initialization procedure which prepares the lag table for the first run. 



4 Performance results and discussion 

For all PRNGs implemented here we use the MS Visual Studio 2008 Express edition 
(C++ compiler) [S3]. Original codes are presented in corresponding literature (see 
Section |X| and in several cases they are translated into C++. 

The CPU implementation of the PRNGs is constructed on the following common 
scheme: each PRNG is implemented as subroutine which produces only one pseudo- 
random number per call. Then the main program sums up all produced values 
and checks the elapsed time. The 10® pseudo-random numbers are used for this 
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cs_2_0 
all 10 



call 11 

// random nuinber in c200 



unc 10 // MT19937 initialization 

dcl_literal 1200, %MTSEED_MASK% , 



feni 



ory 



dcl_literal 1201, %MTSEED_ALL^ 
dcl_literal 1202, 11, 7, 15, : 
dcl_literal 1203, %MTSEEDP2%, 

%MTSEEDP5% 
dcl_literal 1204, 0, 1, 99, 100 
dcl_literal 1205, 56, 57, -1, 
dcl_literal 1206, 0, 155, 98, 99 
dcl_literal 1207, 0, %MT_UPPER_MASK% 

%MT_MaTRIX_A% 
and r201.x,vaTid.x, 1200.x 
umad r201.xy, r201 .XX, 1201 .XX, 1201 .yy 
mov r202 .X, g [r201 .x+%iMTSEED_e24_4%] 
ret_dyn 
endf unc 

func 11 // MT19937 generation 

ult r202.y,r202.x,1200.z 
call_logicalz r202.y, 13 
call 14 
ret_dyn 
endfunc 



1, %MTSEED_624_4%, 
%MTSEED START%, 155, 56 



.MTSEEDP3%, %MTSEEDP4%, 



%MT_LOWER_MASK% , 



func 12 // MT19937 finalization 

iadd g[r201.x+%iMTSEED_624_4% 
ret_dyn 
endfunc 



x,r202.x,1200.y 



func 13 // MT19937 

fence memory 
iadd r203, r201 .x: 
iadd r204.xy,r20 
mov r206,g[r203.: 
mov r208,g[r203. 
whileloop 

mov r205, r206 
mov r206, g [r20 
mov r207,r208 
mov r208,g[r20. 
mov r210 .xyz, 

mov r210. w 

land r211,r205 
land r212.xyz, 

land r212. 

lor r213,r211, 
land r214,r213 
ushr r209,r213 
cmov_logical 
ixor r209,r210 
ixor r209,r209 
mov g[r203 
iadd r203,r203 
uge r211.y, 
break logii 



update seeds 



X, 1204 
XX, 1201 . 



207 .yzw 

208. X 
, 1207 .yyyy 
r205. yzw, 1207. 
,r206.x,1207.z 
r212 

, 1200 .yyyy 
,12 00 .yyyy 
214, r214, 1207. 
, r209 
,r214 
•209 

, 1200 .yyyy 
.,r204 .y 
■211 .y 



iadd r203, r201 .X 

whileloop 

mov r205,r206 
mov r206,g[r20 
mov r207,r208 
mov r208,g[r20 
mov r210.xy; 

mov r210. 

land r211,r205 
land r212.xyz, 

land r212 . w 

ior r213,r211, 
land r214, r213 
ushr r209,r213 
cmQv_logical 
ixor r209,r210 
ixor r209,r209 
mov g[r203 
iadd r203,r203 
uge r211.y,r20 
break_logicaln 

endloop 

fence_memory 

iadd r203, r201 .X 
mov r205,r206 
mov r206,g[r20 
mov r207,r208 
mov r210.xy; 

mov r210. 

land r211,r205 
land r212 . xyz , 

land r212 . w 

ior r213, r211, 
land r214, r213 
ushr r209,r213 
cmov_logical 
ixor r209,r210 
ixor r209, r209 
mov g[r203.y], 

mov r202.x,1200. 



207. yzw 
r208.x 

1207 .yyyy 
r205. yzw, 1207. 
,r206.x,1207.z 
r212 
1200 .yyyy 
1200 .yyyy 
214, r214, 1207. 
■209 
214 
r209 
1200 .yyyy 
.X, r204 .X 
r211.y 



207. yzw 
g[r203.w] .x 

1207 .yyyy 
r205. yzw, 1207. 
,r206.x,1207.z 
r212 
1200 .yyyy 
1200 .yyyy 
214, r214, 1207 . 
■209 
214 
r209 



endfu 

func 14 // MT19937 generate rnd() 

iadd r201.y,r201.x,r202.x 

mov r200,g[r201.y] 
ushr r215,r200,1202.xxxx 
ixor r200, r200, r215 
ishl r215,r200, 1202. yyyy 
land r215,r215,1203.xxxx 
ixor r200,r200,r215 
ishl r215,r200,1202.zzzz 
land r215,r215, 1203. yyyy 
ixor r200,r200,r215 
ushr r215,r200,1202.wwww 
ixor r200, r200, r215 
utof r200,r200 
add r200, r200, 1203 . zzzz 
div r20D, r200, 1203 .wwww 

endfunc 
end 



Figure 7: Source code of MT19937 PRNG 
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Samples 

per 
second 



I Intel Core 2 Quad CPU Q6600 @ 2.40GHz, 4GB RAM (DDR2 5-5-5-16) 
I Intel Celeron CPU 420 @ 1.60GHz, 1.5GB RAM (DDR2 5-5-5-15) 



2.0E+07 




1.5E+07 



1.0E+07 



5.0E+06 



0.0 



Figure 8: Performance results for some PRNGs on two PCs 



procedure. Elapsed time is averaged over several single thread executions and the 
mean CPU performance is found. For the reference PCs we use two machines: 

• Intel Core 2 Quad CPU Q6600 @ 2.40GHz (LI cache 4x32KB, L2 cache 2x4MB), 
4GB RAM DDR2 400MHz Dual Symmetric (5-5-5-16) Command rate 2T; 

• Intel Celeron CPU 420 @ 1.60GHz (LI cache 32KB, L2 cache 512KB), 
1.5GB RAM DDR2 333MHz Dual (5-5-5-15) Command rate IT, 

which correspond to the middle-level and entry-level computers, respectively. 

The CPU performance results are presented in Figure [5] and Table [H It could 
be seen that the differences among the performances of the generators are about 
5-6 times. XOR128 and RAN MAR show the best performance results. Under CPU 
performance here and after we mean the performance of the one-thread instance of 
the algorithm. Of course, one-thread run does not provide maximal system utiliza- 
tion, however it allows comparing the potential performances of the systems. To 
show impartial assessment we can just multiply CPU performance by the quantity 
of the threads, supported up by peculiar processor, because it is always possible 
to run several PRNG instances to produce different independent pseudo-random 
sequences. 

All the PRNGs implementations on GPU are carried out in ATI Intermediate 
Language [31] (ATI Catalyst 10.1 display driver is used [52]). The host environment 
is also reahzed in MS Visual Studio 2008 C++ [63]. All used here CPUs are the 
main GPU devices installed in the system (i.e. they also provide visualization for the 
operational system) which lowers down the maximal performance of the system, but 
reflects more precisely the usual configuration of the GPU computational system. 
To obtain the performance of each PRNG, we run them up to 1000 times each to 
produce 4 x 10^ pseudo-random samples on every pass. Each produced pseudo- 
random number is stored in global buffer. Elapsed time is averaged over several 
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Table 2: The performance results of the presented here PRNG implementations on 
different ATI CPUs (here CPU is Intel Core 2 Quad CPU Q6600 at 2.40GHz and 
CPUs is Intel Celeron CPU 420 at 1.60GHz) 



PRNG 


X 10'' per second 


X 10^ per second 


Speed-up 
factor 


5850 


4870 


4850 


CPU 


CPU2 


GGL 


8.37 


5.05 


4.21 


1.23 


0.80 


681 


XOR128 


8.45 


6.29 


4.52 


1.86 


1.20 


455 


RANECU 


4.98 


3.32 


2.66 


1.21 


0.79 


411 


RANLUX3P 


1.08 


1.02 


0.63 


0.50 


0.33 


216 


RANLUX4P 


1.02 


0.86 


0.58 


0.32 


0.21 


322 


MT19937 


0.50 


0.62 


0.36 


1.07 


0.69 


47 


RANMAR 


0.18 


0.23 


0.14 


1.69 


1.10 


11 



executions. The time spending to copy the initial input seeds to GPU memory and 
final mapping the GPU memory into host memory is not taken into account. 

The performance results are collected in Figure [9] and Table [2] The first column 
of the Table [2] contains the name of the implemented generator. The number pseudo- 
random numbers which could be produced by corresponding PRNG per one second 
on corresponding GPU are show in the columns 2-4. Here HD5850, HD4870 and 
HD4850 are the ATI Radeon video cards. The next two columns contain the same 
information obtained on two mentioned CPUs. The last column shows the speed-up 
factor of the using the ATI Radeon HD5850 in compare with the using of the Intel 
Core 2 Quad CPU Q6600 at 2.40GHz. 

Memory access is a bottleneck of the GPU-applications. It is confirmed once 
again by different PRNG performance results on different ATI GPU hardware. The 
PRNGs with the greater number of the memory operations demonstrate the worst 
performance results. 

In the present realizations GGL and XOR128 generators require to keep only 
one 4-component seed-value per thread. This directly influences their work - both 
generators showed the best productivity. Despite the generator XOR128 in the 
present code has sequential part which slightly lowers the speed of the generator 
operation its performance turned out to be the highest. It may be explained by 
the fact that only bit operations are used in the arithmetic kernel of the XOR128 
generator which along with floating-point operations are the fastest implemented on 
ATI hardware. Generator GGL is realized on the base of the integer scheme of Park 
and Miller [12] which slightly brings down its output in compare with XOR128, in 
spite of less quantity of the intermediate operations. 

The RANECU generator already requires keeping two seed-values for its opera- 
tion. Along with using integer arithmetical operations in RANECU kernel it some- 
what lowers its performance in compare with GGL and XOR128 generators. The 
algorithm simplicity, the performance up to 5 x 10^ samples per second and the 
considerably better statistical properties allow extensively using RANECU genera- 
tor on GPU. In order to increase the RANECU period it is reasonable to employ the 
MRG based on the combinations not two but three MLCGs. However this extension 
requires the additional study of the new generator statistical properties. It may be 
a subject of the further research in this field. 

The RAN LUX generator shows considerably high performance on GPU. To a 
significant degree it may be explained by fortunate matching of GPU architecture 
and the generator parameters. It is managed to minimize generator dependence 
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ATI Radeon HD 5850 (RV870PRO, Core: 725MHz, Memory: 1000MHz, GDDR5) 
ATI Radeon HD 4870 (RV770XT Core: 750MHz, Memory: 900MHz, GDDR5) 
ATI Radeon HD 4850 (RV770PRO, Core: 625MHz, Memory: 993MHz, GDDR3) 




Figure 9: Performance results for some PRNGs on different GPUs 



on decorrelating luxury procedure by the actual organization of the virtual cache. 
Thereby it is possible to use maximal luxury level with the minimal performance 
penalty in practical applications. In the opposition to CPU realization where the 
difference between the performances at luxury level=0 and luxury level=4 reaches 
6 and more times, on GPU this parameter is only 10-23% (for different hardware). 

It is obviously that performance advantage also depends on the way of algorithm 
realization. In order to demonstrate this fact, we showed in Table[3]the performance 
values of different realization of the same algorithm for RANLUX generator for a 
number of GPUs. RANLUX implementation through the global buffer is the slowest 
one. In fact the present realization of algorithm entirely reproduces the dependence 
of the GPU realization on luxury level. Acceleration in compare with CPU is 
achieved only because of such GPU parameters as number of stream cores and 
engine clock rate. An unexpected result was the fact that realization of for global 
buffer realization of RANLUX HD4870 GPU shows better performance than HD5850. 
It is closely related to the fact that memory data access in the kernel is organized not 
in the best way and a lot of time is spent to synchronize memory access operations. 
Obtaining better performance is possible by lowering the instructions density which 
refer to the global buffer. 

Second RANLUX realization, through the indexed temporary array, is a bit faster 
than the first one. But the same situation with HD4870 and HD5850 GPUs per- 
formances can be observed here as well. The present realization has sense only 
for luxury levels 2-4, when time share, which is spent on indexed temporary ar- 
ray preparation, is compensated by luxury operation saving time itself. The main 
strong feature of this implementation is the complete correspondence to the classic 
algorithm, published in |31| . 

The last presented implementation of RANLUX (which is called planar here) 
possesses the best performance due to maximal reduction of memory access opera- 
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Table 3: Performance results of the different implementations of RANLUX generator 
on some ATI CPUs (ATI Radeon HD5850, HD4870, HD4850), xlO^ pseudo random 
numbers per second 



luxury 
level 


Planar 


Temporary array 


Global buffer 


5850 


4870 


4850 


5850 


4870 


4850 


5850 


4870 


4850 





11,35 


10,64 


6,54 


4,07 


5,56 


3,35 


5,02 


6,92 


4,32 


1 


11,00 


10,14 


6,19 


4,05 


5,39 


3,33 


3,50 


4,66 


2,71 


2 


11,01 


10,17 


6,20 


4,02 


5,00 


3,23 


2,18 


2,61 


1,55 


3 


10,85 


10,17 


6,27 


3,85 


4,18 


2,83 


1,11 


1,20 


0,72 


4 


10,24 


8,62 


5,77 


3,51 


3,44 


2,42 


0,68 


0,71 


0,43 



tions quantity. General advantage of performance on GPU in compare with CPU 
is up to 322 times (highest luxury level = 4). The modified algorithm is used here 
(see section I3.6P at which a bit higher quantity of pseudo-random numbers after 
one PRNG cycle are discarded away for decorrelation of lag table elements, than it 
is set up in the classic algorithm. 

In the first two RANLUX realizations of the algorithm, it requires to read four 
memory cells (not accounting for the luxury procedure operations) and write three 
cells (two lag table items, carry bit and indices) to obtain one pseudo-random 
number (updated lag table item, new carry bit and new indices). In planar scheme 
the carry bit occupies one of the four-component cells with the indices, so it needs 
less quantity of reading and writing operations (four read and two writes). 

The performance of the shown code of the MT19937 generator is turned out to 
be rather low. The main factor is the size of the lag table, the update of which is 
required after each PRNG cycle. Planar scheme applied here allows increasing the 
generator performance roughly in four times in compare with the direct four-threads 
realization, but the quantity of the memory access operations remains rather high 
which does not allow to achieve acceptable results. It is possible to reduce the lag ta- 
ble size by choosing another generator from the Mersenne Twister family. But from 
one hand, MT19937 generator parameters allow to realize the planar scheme (proved 
to be a good here), and from another hand there is a task to build the analogue 
of the actually used generators on GPU while algorithms realization. Although to 
generate one pseudo-random number it is needed only two read operations and one 
write only, the lag table update procedure drastically decelerates the generator. 

The worst performance is shown by RAN MAR generator which differs by the 
large enough lag table and high memory access operation density (four reads and 
three writes). 

All presented generators realizations allow to produce at slight changes pseudo- 
random numbers with double precision either by using the pairs of the generated 
numbers or directly through the double precision conversion. The last method is not 
fully correct, because it considerable lowers the quantity of the possible realization 
of the obtained double precision number. 

Paying attention to the XOR128 performance one can asserts that LYEcuyer 
generator Seven-XORShift [^S] is seemed to be very promising for realization on 
GPU due to the similarity to the XOR128 structure except for the eight-element lag 
table. 

The periods of the XOR128, RANLUX, RAN MAR and MT19937 generators are 
considered to be unachievable in medium-term perspective even for GPU clusters. 
Meanwhile the period of the GGL generator can be exhausted in a split second 
even on one relatively old GPU. So, it is necessary to pay special attention to the 
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generator applicability borders while developing applications on GPU. 

Undoubtedly the presented algorithms realization on the platform independent 
level such as OpenCL will be very important, but it goes beyond the present work 
frame and is a task for future research. The obtained results of the generators per- 
formances make it interesting to use the GPU for the investigation of the generators 
statistical properties. 

5 Conclusions 

In the present paper the most popular uniform pseudo-number generators which 
are used in Monte Carlo simulations in high-energy physics are realized on GPU. 
The list of the modern software packages with the indication of the generators used 
is represented. A theoretical background for pseudo-random number generation 
is described. The source codes of the implemented generators (multiplicative lin- 
ear congruential (GGL), XOR-shift (XOR128), RANECU, RANMAR, RANLUX and 
Mersenne Twister (MT19937)) are showifl 

The comparative analysis of the PRNG performances on CPU and ATI GPU is 
presented. The obtained speed-up factor is hundreds of times higher as compare to 
CPU. Performance analysis of the mentioned generators with taking into account 
their statistical properties allows to conclude that the most appropriative generator 
for Monte Carlo simulations on GPU is planar RANLUX with luxury level 4. Offered 
planar scheme makes it possible to increase significantly the performance of the most 
generators by the reducing memory access operations. 

Offered PRNG program model (generator division into separate subroutines) 
also allows to obtain additional considerable gain of performance at the multiple 
calls of the PRNG generating subprogram. This is achieved by means of a substan- 
tial reduction in the number of the memory access operations per PRNG cycle. 

In the present paper the generator performances in different realizations are 
analyzed by way of RANLUX PRNG example. It is shown once again that the 
memory access operations are the bottleneck of CPUs. It is possible to conclude 
that for GPU implementations of PRNGs it is better to choose algorithms with the 
minimal lag table size (which to be advisable multiply by 4) perhaps in spite of 
some complication of algorithms. 

Potential of GPU implementation in Monte Carlo simulations is not fully realized 
nowadays. This is despite of large amount of works dedicated to this problem. GPU 
sector development dynamics makes it possible to predicate that this trend is one 
of the most promising nowadays. 
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A Basic classes of PRNGs 



In this section we briefly provide the theoretical background for the main classes 
of the pseudo-random number generators. There are numerous books and reviews 
about the subject, so the details could be found in the references. 

While pseudo-random numbers generation, there are only two internal sources 
of the randomness which could be used in the algorithm. They are the sequence 
itself (previous values in a sequence precisely) and the starting parameters (PRNG 
parameters and seed values). Actually, the algorithms of the PRNGs are distin- 
guished by the way of the using these sources. So, particular PRNG is a certain 
function / which produces the next value X„, 

j^PRNG _ /(X„_i, X„_2, • • • 7 Xn^r)- (2) 

Here and below we will use upper index PRNG to identify the sequence produced by 
particular generator PRNG. The maximum period of the generator is the length of 
the cyclic sequences produced by PRNG and is limited by the number of the states 
that can be represented by PRNG. 

First simple PRNGs uses only one previous value Xn-i {r = 1) for generation, 
but in such scheme the PRNG period is limited by the bit capacity of the machine. 
If one uses the longer tables of the sequence of the previous values, from one hand 
it makes the generator period longer and its statistics properties better, as well 
as allows to simplify transformation function / for getting better output of the 
generator. From the other hand, longer tables require more complicated generator 
initialization and reduce its portability (the strict control of the architecture is 
needed, for example, the size of the cache memory), and it is obviously necessary to 
have much more memory to store such tables. A good generator is always a golden 
middle between algorithm complexity, the statistical properties of the generator and 
the size of the seed table. 

According to the basic requirement for PRNG - repeatability, any starting state 
of the generator may cause only one specified sequence. The initialization of the 
seed table is often made by simpler generator which has lower requirements. Most 
of all linear congruential generators or their combination is used for initialization. 
So, it allows to set starting generator states by the limited set of the input seed 
values (often 1-2 numbers). But Marsaglia [3] drew our attention to the fact that 
the quantity of the starting states decreases drastically. For example, Mersenne 
Twister generator MT19937 uses 624-element seed table (which may contain about 
^232-^624 ^ lO^"^! different values) whenever 32-bit number is usually used for its 
initialization, for which only (2^^) ~ 4 x 10^ possible values are given. Certainly, 
algorithm contains the possibility of the vector initialization. Nevertheless, only one 
seed is used to simplify the initialization. 

The period of PRNG nearly always depends on its parameters and seed values. 
Thus, we always show only the upper bound for the value of the generator period. 

A.l Linear congruential generators 

Linear congruential generators (LCG) is one of the oldest and most popular class 
of the PRNGs, which is widely used in computations in particular due to the en- 
cyclopedic work of Knuth [11] . It is based on so-called linear congruential integer 
recursion, 

Xlf-^ = {aXn-i + c) mod m, (3) 

where increment c and modulus m are desired to be positive coprime integers (c < 
m) to provide a maximum period, multiplier a is an integer in the range [2; (m — 1)]. 
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If increment c = the LCG is often called the multiplicative linear congruential 
generator [MLCG], 

X^^^^ = aXn-i mod m. (4) 
The maximum period of LCG Plcg strongly depends on LCG parameters and 

acG<(TO-i). (5) 

Demonstrative situation with poor choice of LCG parameters happened with infa- 
mous generator RANDU - MLCG(a = + 3, 77i = 2^^), 

^RANDU ^ 65539X„_i mod 2147483648, (6) 

which suffers from three-point correlations among the sequential elements: 

xr'°^-6X„_i~9X„_2. (7) 

Another well-known LCG is the standard 48-bit generator DRAND48, which is 
a LCG(a = 25214903917, c = 11, m = 2^8) 

^DRAND48 ^ (25214903917X„_i + 11) mod 2^8. (8) 

Park and Miller propose [12] a portable minimal standard Lehmer generatoi0 [T3] 
known as prime modulus multiplicative linear congruential generator (PM MLCG). 
Sometimes it is also denoted by the acronyms RANO, CONG, SURAND or GGL. Park 
and Miller choose the Mersenne prime number 2^^^ — 1 as the modulus to, multiplier 
a = 7^ and increment c = 0, 

X^'^^ = 16807X„_i mod 2147483647. (9) 

The total period of the GGL is relatively short, Pggl = (2^^ - 1) - 1 = 2147483646. 

One of the main well-known problems of the LCG is that every LCG produces 
n-tuples of uniform variates which lie in at most parallel hyperplanes |14l [T5] . The 
other defects of the LCG are: 

• the dependence of the generator period on initial seed; 

• the influence of the chosen modulus on statistical properties of the pseudo- 
random sequence; 

• lowest bits are not random. 

And finally, the minimal integer value produced by MLCG is 1, not 0. 

A. 2 Feedback shift register generators 

In 1965 Tausworthe |18| introduced a new generator based on bit sequence 

^"^^^ = a,Xn-}j mod 2, (10) 

where = {0,1}, Xi = {0,1}, r < n, is called linear feedback shift register 
algorithm (LFSR). 

The period of the LFSR is the smallest positive integer P for which 

(Xr-l, ■ ■ • , Xq) = {Xp+r-1, ■ • • , Xp) . (11) 



^The original parameters of Lehmer generator are a = 23, m = 10* -|- 1 
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The next state could be obtained with the fohowing transformation 
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The characteristic polynomial of r x r transformation matrix A 
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(14) 



a{x) = det (A - xl) 



(15) 



where / is an identity matrix. 

According to the theory of the finite fields [19] the maximal period PpSRG = 
2*" — 1 is achieved if and only if the characteristic polynomial a{x) is an irreducible 
polynomial over Galois field GF{2). In other words, if and only if the smallest 
positive integer p: (x^ mod a{x)) mod 2 = 1 is p = 2'' — 1 [7]. 

For computational efficiency, most of the Ui in (|10p should be zero. In GF{2) 
there is only one irreducible binomial, x + 1, which would yield an unacceptable 
period [8]. Consequently, the trinomials are usually used to express the recurrence 
sequence ([T0| . 



X,„ 



{Xn-r + Xn-s) mod 2, s < r < n. 



(16) 



Addition modulo 2 for one-bit variables is ordinary binary exclusive-or operation 
XOR, so (fT6|) may be rewritten as 



Xn — Xn-r XOR Xj, 



(17) 



LFSR recursion produces pseudo-random bit sequence. To obtain fc-bit 
pseudo-random integers Yn from such recursion, one can group up k sequential bits. 



fe 



2*^ Xkn+j-l- 



(18) 



Such method is called the digital multistep method of Tausworthe [20] . 

Another method proposed by Lewis and Payne |3T] is the generalized feedback 
shift register ( GFSR) . In GFSR scheme bits in the positions j of the pseudo-random 
integer are filled with the copy of initial one-bit recursion p7[) which has a period 
2'' — 1 with some nonnegative offsets dj , 



(19) 
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Clearly, LFSR is the particular case of GFSR. 

For example of GFSR it could be mentioned the R250 generator [52]. It is 
another infamous PRNG which causes the severe problems in the Monte-Carlo 
simulations of the two-dimensional Ising model using the single-cluster Wolff update 
algorithm (see [9] and references therein). For R250 the GFSR parameters are 
r = 250 and s = 103, 

j^R250 _ Xn~250 XOR Xn-103- (20) 

The R250 period is Pr25o = 2^50 - 1 « 1.81 x lO'^^ 
A. 3 Lagged Fibonacci generators 

Lagged Fibonacci generators (LFG) is another class of the PRNGs. It is based on 
the well-known Fibonacci recurrence sequence 

+ ^„-2. (21) 

Because the simple Fibonacci generator is not very good (23j one always uses the 
generalized relation pip with respect to any given binary arithmetic operation 
and prehistory 

X\f^ = [Xn-r Xn-s) mod m, (22) 

where r and s are called "lags", r < n and 1 < s < r. 
The LFG period Plfg for different operations is 

r (2'' - l)m/2 for + or - 
Plfg < \ (2'' - l)m/8 for x . (23) 

I (2'' - 1) for XOR 

The main attractive features of the LFGs are long period, potential absence 
of conversion integer into float operations and simple recursive scheme which not 
requires the heavy mathematical operations. However, for generation LFGs it is 
needed to store r previous pseudo-random values. 

There are numerous possible pairs of the LFG lags [11]. The larger lags lead to 
the decreasing of the correlations between the numbers in the sequence. But even 
for relatively short lag table r > 20 it passes many statistical tests. 

RAN 3 generator [TT] could be mentioned as an example of the LFG. It was 
proposed by Mitchell and Moore (unpublished), with lags r = 55 and s = 24. 
m = 10^ and operation subtraction, 

X^^"""^ = (X„_55 - ^n-24) mod 10^ (24) 

The RAN3 period is Pran3 = (2'^^ - 1)10V2 w 1.8 x 10^^ 

A. 4 Combined generators 

Combined generators are a special class of the PRNGs, which contains the features 
of the different PRNG classes. There are two main motivations to use combined 
generators: 

• the period increasing of the generator, 

• the improving of the generator statistical properties. 
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A. 4.1 Multiple recursive generators 

The obvious extensions of tlie LCG is tlie multiple recursive generator {MRG) [ini 
[T7] which is determined as the combination of the MLCGs 

X^^^ = (aiX„_i + a2^n-2 + . . . + a/c^„-fe + c) mod m. (25) 

When A: > 1, MRG is usually called MRG of the k order. The maximal period 
of the MRG is Pmrg < m'' - 1. In fact LFG is the special case of the MRG (for 
multipliers = 1 and c ~ 0). 

By decomposition of the modulus of the MLCG into two terms m ~ aq + r and 
eqn.dH) may be written as following [TB] 

Xn = aXn-1 mod m = (a(X„_i mod q) — yXn^i/ q\r) mod m, (26) 

Xn = Xn + m for Xn < 0, 

where [a/6J denotes the integer part of the (a/b) and 

q ~ [m/aj, r = m mod a. (27) 

To provide the uniform distribution the combinations of I generators ([26|) may 
be combined as [16] 

Xn = (^^{-iy'^X,^}j mod (mi - 1), (28) 

Xn=Xn + (mi - 1) for Xn < 0. 

Here the new index j in Xj^n means the n-th value (|26p of j-th generator. 

One of the possible MRGs is the RAN ECU generator [16]. It is the combination 
of two MLCGs {I = 2) with ai = 40014, mi = 2147483563 and 03 = 40692, ma = 
2147483399. The RAN ECU period is Pranecu = (mi -1) (7712 - 1)/2 w 2.30584x10^^. 
MRGs have good statistics and pass most the tests. 



A.4.2 XORShift 

XORShift PRNG, proposed by Marsagha [25], is another member of the GFSR 
generators class. Let Xq be a some initial fc-bit row-state of XORShift and T is 
k X k nonsingular binary matrix which sets linear transformation. The n-th PRNG 
state may be derived through the following equation 

^XORShift ^ ^29) 

To ensure the performance requirements Marsaglia proposed the special form of 
matrix T, 

T = {I + L"") {I + R'') (I + L^) , (30) 

where matrices L and R are kxk binary matrices which effect shift of one to the left 
and right, correspondingly. So, if X„i is a fc-bit state then causes the new state 
L'^Xrn = (^m < «) as wch as {I + L") - the state (/-f i")X„ = X^ XOR (X^ < 
a). In [25] Marsaglia lists all possible full-period triplets (a, 6, c) for 32-bit (648 
combinations) and 64-bit (2200 combinations) XORShift PRNG. 
The maximal period of XORShift is 

/^XORShift < 2^= - 1. (31) 
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In spite of XORShift PRNG passes the DIEHARD Battery of Tests of Random- 
ness |25j L'Ecuyer appoints that it "spectacular failed" the SmallCrush and Crush 
tests 121]. L'Ecuyer does not recommend to use this class of the generators, but 
proposes the own version of the XORShift implementation - Seven-XORShift. 

Marsaglia gives an example of the XORShift generator for 128-bit vector with 
four 32-bit components - XOR128 PRNG [25], 



n — 3 



X 



ri-2 



Xn-1, Xj 



nXOR128 



/ 
0/0 
\ / 

or in the terms of the 32-bit components 



(X„_4, Xn~3, ^n-l) 

/ + \ 

' ^ ' 




t = (X„_4 XOR (X„_4 < 11)), 

Xn-3 ~ Xn~2, Xn-2 = Xn~l, Xn-1 = Xn, 

X„ ^ (X„_i XOR (X„_i > 19)) XOR {t XOR > 8)). 
The XOR128 period is PxORi28 < S^^s _ i. 



(32) 



(33) 



A. 4. 3 Mersenne twister 

One of the most "fashionable" modern PRNGs is the Twisted GFSR generator 
{TGFSR) or Mersenne twister generator |27l I28j. TGFSR is the modernization of 
the GFSR and its algorithm is based on the following recurrence for w-bit vectors 

Xn 



XT^FSR ^ x^^^ XOR (X^P"- OR X°7i") A, 



(34) 



where superscript indices "upper" and "lower" denote the w^u highest and u lowest 
bits of a corresponding binary vector Xi, respectively. 



^upper ^ - 2") 

slower ^ p^^^ - 1), 



(35) 



X\ 



and matrix A is a "twisting" binary w x w matrix, which form is chosen by perfor- 
mance reason 
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(36) 



1 

\ aw-1 aw-2 ato_3 • • ■ ao / 

So, only one lu-bit vector a ~ [a^-i, atD-2, • ■ • , ao) defines the product XiA, 



X,A 



{X, > 1) if {X, AND 1) = 

(A, > 1) XOR a if (A, AND 1) = 1 



(37) 



For improving the statistical properties of the sequence so-called tempering proce- 
dure is applied for the output sequence A„. This procedure is defined with the 



t = Xn XOR (A„ > to), 
t = t XOR {{t < d) AND &), 
t = t XOR {{t < e) AND c), 
t XOR {t > 0- 



(38) 



TGFSR 
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By appropriate choosing of r, u parameters and binary vector a, it might reach the 
maximal period of the TGFSR, 

Ptgfsr < (2™-"-l). (39) 

The most famous implementation of the TGFSR PRNG is MT19937 EH]- The 
TGFSR parameters of the MT19937 are the following: 

w = 32, r = 624, s = 397, u = 31, (40) 
a = 99O8BODF16 = IOOIIOOIOOOOIOOOIOIIOOOOIIOIIIII2 

and tempering parameters are 

m = 11, 1 = 18, (41) 
d = 7, 6 = 9D2C5680i6, 
e = 15, c = EFC6OOOO16. 

The maximal period of the MT19937 is 

PMT19937 < 2^24x32-31 _ ^ ^ 2^9937 _ 1 ^ 4.3 ^ 10«°"1. (42) 

A.4.4 RANMAR 

RAN MAR |231 is a combination of two generators, 24-bit lagged Fibonacci gen- 
erator LFG(97, 33, -) y„ with m = 2'^'^ = 16777216 and simple arithmetic sequence 
C„ for the prime modulus M = 2'^'^ - 3 = 16777213, 

^RANMAR ^ (^y^ _ ^_ (43) 

Or equivalently the producing recurrence is, 

Xr^^^^ = i;-C„, (44) 
X„ = X„ + m for Xn < 0. 

Here y„ and C„ are 

Yn = (Fri-97 - i^ri-33) Hiod 771, (45) 

Cn = {C„-i - D) mod M, 

where D = 7654321. 

The LFG(97, 33, -) period is Plfg(97,33,-) < (2^^-1)2^4/2 ~ 2^"^^ (see eqn 

Therefore, the total period of the RANMAR is 



and the period of the arithmetic sequence C„ is P\_cGti,2'^^) < (2^"' — 1) (see eqn.(IS] 



-Pranmar < ~ 2.23 X 10«. (46) 

A. 4. 5 Add-With-Carry and Subtract- With-Borrow generators 

In 1991 Marsaglia and Zaman introduced a new class of PRNGs: add-with-carry 
{AWC) and subtract- with-borrow {SWB) ^29j, which are small modifications of the 
LFG with respect to supplementing an extra carry or borrow bit. Due to the 
branching it became the first class of nonlinear PRNGs [S]. The AWC generator is 
described by the sequence 

X^^ = (X^^r + ^„-s + c„-i) mod m, (47) 

^ _ I 1 if {Xn-r + Xn-s + C„-l) > m 
I if {Xn-r + Xn-s + Cn-l) < m 
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In addition to r seed values generator must be initialized with the 

carry bit c^. The maximal period of the AWC generator is 



Pmc < m'- + m'' - 2. (48) 
In the SWB case the subsequent values of the sequence are obtained by 

XSWB ^ (x„_, _ Xn-s - c„-i) mod m, (49) 

^ f 1 if {Xn-r - Xn-s - C„_i) < 

I if (X„_,-X„_, -c„_i) >0 ■ 

L'Ecuyer noted |10) that SWB has a second variant, in which indices r and s in 
(j49| are swapped. The maximal period of the SWB generator is 

AwB < m'' - m' - 2. (50) 

The well-known example of the SWB is RCARRY [23] which underlies the RAN- 
LUX PRNG. The RCARRY parameters in (gS]) are to = 2^*, r = 24 and s = 10, 



^ I 1 if (X„_24 - -'^n-lO - C„_i) < 

" I if (X„_24 - Xn-io - c„_i) > ■ 
The RCARRY period is about [29] 

Prcarry < ((2'^)"^ - (224)1" _ 2) /48 ~ 1/3 x 2"^ 5.15 x IQi^i. (52) 



It is less than the maximal period of the SWB ([50)) because modulus m 
not a prime number. 



A.4.6 RANLUX 

Despite all advantages (extremely long period, portability and good productivity) 
AWC and SWB generators suffer from some statistical defects (a bad lattice struc- 
ture), showed by Liischer [3D]. To eliminate these lacks James [21] implements the 
Liischer's [30] idea to modify the SWB generator RCARRY. In the new generator 
which was called RANLUX (for LUXury RANdom numbers |3T]) after producing 
r = 24 pseudo-random values the p — r following sequential values are discarded. 
It might be used any values of p, but there are five generally accepted levels of the 
luxury every of which has its own value p |301 131j , 

• level {p = 24): complete equivalent to original RCARRY generator, there 
are no discarding values 

• level 1 (p = 48): throws out 24 values after one generation cycle; consider- 
able improvement in quality over RCARRY, passes the gap test, but still fails 
spectral test 

• level 2 (p = 97): passes all known tests, but theoretically still defective 

• level 3 {p ~ 223): default level, any theoretically possible correlations have 
a very small chance of being observed 

• level 4 {p = 389): highest possible luxury, all 24 bits of the mantissa are 
chaotic. 

Liischer recommends |30j to use a default value p = 223 and notes that employment 
of values p > 389 is pointless. 



27 



References 



"Top 500 list of supercomputers", http://top500.org/ 

P. Coggington, "Random Number Generators for Parallel Computers", The 
NHSE Review (1996). 

G. Marsaglia. "Random Number Generators", J. Mod. Appl. Stat. Methods 2 
No.l (2003) 2. 

D. Thomas and W. Luk, "Uniform Generators for GPUs", 
http://www.doc.ic.ac.uk/ dt 10/ research/ rngs-gpu-uniform.htm I 

W. Langdon, "A fast high quality pseudo random number generator for nVidia 
GUDA", GECCO '09 Proceedings (2009) 2511. 

I. Vattulainen, "New tests of random numbers for simulations in physi- 
cal systems". Licentiate Thesis, Tampere University of Technology (1994); 
arXiv:9411062 [cond-mat]. 

P. L'Ecuyer, "Random Number Generation", chapter 2 of the "Handbook of 
Computational Statistics: Concepts and Methods", J. Gentle, W. Hardle and 
Y. Mori (eds). Springer- Verlag (2004) 1070. 

J. Gentle, "Random Number Generation and Monte Carlo Methods", 2nd ed., 
New York: Springer (2003) 381. 

W. Janke, "Pseudo Random Numbers: Generation and Quality Checks", 
Quantum Simulations of Complex Many-Body Systems: From Theory to Al- 
gorithms, Lecture Notes, J. Grotendorst, D. Marx, A. Muramatsu (Eds.), 
John von Neumann Institute for Computing, Jiilich, NIC Series, Vol. 10 (2002) 
447. 

P. L'Ecuyer, "Uniform Random Number Generation", Annals of Operations 
Research 53 (1994) 77. 

D. Knuth, "The art of computer programming". Vol. 2, "Seminumerical algo- 
rithms", 3rd ed. (1997) 762. 

S. Park and K. Miller, "Randoms Number Generators: Good Ones are Hard 
to Find", Commun. of the ACM 31 Number 10 (1988) 1192. 

D. Lehmer, "Mathematical methods in large-scale computing units", Annu. 
Comput. Lab. Harvard Univ. 26 (1951) 141. 

G. Marsaglia, "Random Numbers Fall Mainly in the Planes", PNAS Vol.61, 
No. 1 (1968) 25. 

G. Marsaglia, "The structure of linear congruential sequences". In "Applica- 
tions of Number Theory to Numerical Analysis", ed. S. Zaremba, Academic 
Press, New York (1972) 248. 

P. L'Ecuyer, "Efficient and Portable Combined Random Number Generators", 
Commun. of the ACM 31 Number 6 (1988) 742. 

P. L'Ecuyer, "Random Numbers for Simulation", Commun. of the ACM 33 
Number 10 (1990) 85. 

R. Tausworthe, "Random Numbers Generated by linear Recurrence Modulo 
Two", Math. Comp. 19 (1965) 201. 



28 



[19] R. Lidl and H. Niederreiter. "Introduction to Finite Fields and Their Appli- 
cations", Cambridge University Press (1986) 407. 

[20] H. Niederreiter, "Random number generation and quasi-Monte Carlo meth- 
ods", SIAM (1992) 241. 

[21] T. Lewis and W. Payne, "Generalized Feedback Shift Register Pseudorandom 
Number Algorithm", J. of ACM 20 (1973) 456. 

[22] S. Kirkpatrick and E. StoU, "A Very Fast Shift- Register Sequence Random 
Number Generator", J. of Comput. Phys. 40 (1981) 517. 

[23] F. James, "A Review of Pseudorandom Number Generators", Comput. Phys. 
Commun. 60 (1990) 329. 

[24] G. Marsaglia and A. Zaman, "Toward a Universal Random Number Genera- 
tor", Florida State University Report FSU-SCRI-87-50 (1987). 

[25] G. Marsaglia, "Xorshift RNGs", J. of Stat. Soft. 8 (2003) 1. 

[26] F. Panneton and P. L'Ecuyer, "On the xorshift random number generators", 
ACM TOMACS 15 issue 4 (2005) 346. 

[27] M. Matsumoto and Y. Kurita, "Twisted GFSR generators", ACM TOMACS 
2 (1992) 179. 

[28] M. Matsumoto and T. Nishimura, "Mersenne twister: a 623-dimensionally 
equidistributed uniform pseudo-random number generator", ACM TOMACS 
8 (1998) 3. 

[29] G. Marsaglia and A. Zaman, "A New Class of Random Number Generators", 
Ann. Appl. Prob. 1 (1991) 462. 

[30] M. Luscher, "A Portable high quality random number generator for lat- 
tice field theory simulations", Comput. Phys. Commun. 79 (1994) 100 
|arXiv:hep-lat/930902^ . 

[31] F. James, "RANLUX: A FORTRAN implementation of the high quality pseu- 
dorandom number generator of Luscher", Comput. Phys. Commun. 79 (1994) 
111 [Erratum-ibid. 97 (1996) 357]. 

[32] M. Di Pierr o and J. M. Flynn, "Latt ice QFT with FermiQCD", PoS LAT2005, 
104 (2006) |arXiv:hep-lat/0509058 ||. 

[33] FermiQCD, http://web2py.com/fermiqcd/ 

[34] M. Di Pierro, "From Monte Carlo integration to lattice quantum chromody- 
namics: An introduction", |arXiv:hep-lat / 00090dl[ 

[35] MILC Collaboration, http://www. physics. utah.edu/^detar/milc/ 

[36] Columbia Physics System, http://qcdoc.phys.columbia.edu/cps.html 

[37] SZIN Software System, http://www.jlab.org/~edwards/szin/ 

[38] F. E. Paige, S. D. Protopopescu, H. Baer and X. Tata, "IS A JET 7.69: 
A Monte Carlo event generator for p p, anti-p p, and e+ e- reactions", 
arXiv:hep-ph/0312045| 

[39] ISAJet Monte Carlo Event Generator, http://www.hep.fsu.edu/~isajet/ 



29 



[40] Geant4 toolkit, http://geant4.web.cern.ch/geant4/ 

[41] HEPRandom module, 

http:// proj-cl hep. web. cern.ch/proj-cl hep/ma nual/UserGuide/ Random/ Random, htm I 

[42] T. Sjostrand, S. Mrenna and P. Z. Skands, "PYTHIA 64 Physics and Man- 
ual", JHEP 0605, 026 (2006) | arXiv:hep-ph/0603175 |. 

[43] PYTHIA event generator, http://home.thep.lu.se/~torbjorn/Pythia.html 

[44] G. Corcella et al, "HERWIG 6.5: an event generator for Hadron Emission Re- 
actions With Interfering Gluons (including supersymmetric processes)", JHEP 
0101, 010 (2001) |arXiv:hep-pli/0011363| . 

[45] HERWIG package, http://hepwww.rl.ac.uk/theory/seymour/herwig/ 

[46] A. Pukhov et al., "CompHEP: A package for evaluation of Feynman diagrams 
and integration over multi-particle phase space. User's manual for version 
33", arXiv:hep-ph/9908288 , 

[47] CompHEP package, http://comphep.sinp.msu.ru/ 

[48] S. Frixione and B. R. Webber, "Matching NLO QCD computations and parton 
shower simulations", JHEP 0206, 029 (2002) |arXiv:hep-ph/0204244|. 

[49] MCQNLO package, http://www.hep.phycam.ac.uk/theory/webber/l\/lCatNLO/ 

[50] T. Gleisberg, S. Heche, F. Krauss, A. Schalicke, S. Schumann and J. C. Wm- 
ter, "SHERPA 1. alpha, a proof- of- concept version", JHEP 0402, 056 (2004) 
|arXiv:hep-ph/0311263| . 

[51] SHERPA package, http://projects.hepforge.org/sherpa/dokuwiki/doku.php 

[52] R. G. Edwards and B. Joe [SciDAC Collaboration and LHPC Collaboration 
and UKQCD Collaboration], "The Chroma software system for lattice QCD", 
Nucl. Phys. Proc. Suppl. 140 (2005) 832 |arXiv:hep-lat70409003] . 

[53] http:// usqcd.jlab.org/usqcd-docs/chroma/ 

[54] C. Andreopoulos et al., "The GENIE Neutrino Monte Carlo Generator", 
larXiv:0905.25T7l [hep-ph]. 

[55] GENIE neutrino MC generator, http://www.genie-mc.org/ 

[56] M. L. Mangano, M. Moretti, F. Piccinini, R. Pittau and A. D. Polosa, 
"ALPGEN, a generator for hard multiparton processes in hadronic collisions", 
JHEP 0307, 001 (2003) | arXiv:hep-ph/0206293 |. 

[57] ALPGEN package, http://mlm.home.cern.ch/mlm/alpgen/ 

[58] V. Demchik and A. Strelchenko, "Monte Carlo simulations on Graphics Pro- 
cessing Units", la rXiv :0903.3053 [hep-lat]. 

[59] AMD Intermediate Language (IL) Specification (v2), 
|http://developer.amd.com/gpu/ATIStreamSDK/assets/| 
ATI _ I ntermediate_ La nguage_( I L) _Specification_v2.pdf 

[60] Comparison of ATI CPUs, 

jhttp: //en. wikipedia.org/wiki/Comparison_of_ATI_graphics_processing_ units 

[61] ATI Stream SDK, 

http://developer.amd.com/gpu/ATIStreamSDK/ 



30 



[62] ATI Catalyst Display Driver, 

://ati. a md.com/support/driver. html 

[63] Microsoft Visual C++ 2008 Express Edition, 
|http: //www, microsoft.com/express/down load/, 



31 



